IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] parallel process(118hit)

61-80hit(118hit)

A Real-Time Low-Rate Video Compression Algorithm Using Multi-Stage Hierarchical Vector Quantization
Kazutoshi KOBAYASHI Kazuhiko TERADA Hidetoshi ONODERA Keikichi TAMARU

PAPER

Vol:
E82-A No:2
Page(s):
215-222
We propose a real-time low-rate video compression algorithm using fixed-rate multi-stage hierarchical vector quantization. Vector quantization is suitable for mobile computing, since it demands small computation on decoding. The proposed algorithm enables transmission of 10 QCIF frames per second over a low-rate 29.2 kbps mobile channel. A frame is hierarchically divided by sub-blocks. A frame of images is compressed in a fixed rate at any video activity. For active frames, large sub-blocks for low resolution are mainly transmitted. For inactive frames, smaller sub-blocks for high resolution can be transmitted successively after a motion-compensated frame. We develop a compression system which consists of a host computer and a memory-based processor for the nearest neighbor search on VQ. Our algorithm guarantees real-time decoding on a poor CPU.
Viewpoint-Based Similarity Discernment on SNAP
Takashi YUKAWA Sanda M. HARABAGIU Dan I. MOLDOVAN

LETTER-Artificial Intelligence and Cognitive Science

Vol:
E82-D No:2
Page(s):
500-502
This paper presents an algorithm for viewpoint-based similarity discernment of linguistic concepts on Semantic Network Array Processor (SNAP). The viewpoint-based similarity discernment plays a key role in retrieving similar propositions. This is useful for advanced knowledge processing areas such as analogical reasoning and case-based reasoning. The algorithm assumes that a knowledge base is constructed for SNAP, based on information acquired from the WordNet linguistic database. The algorithm identifies paths on the knowledge base between each given concept and a given viewpoint concept, then computes a similarity degree between the two concepts based on the number of nodes shared by the paths. A small scale knowledge base was constructed and an experiment was conducted on a SNAP simulator that demonstrated the feasibility of this algorithm. Because of SNAP's scalability, the algorithm is expected to work similarly on a large scale knowledge base.
Evaluation of Shared DRAM for Parallel Processor System with Shared Memory
Hiroyuki KURINO Keiichi HIRANO Taizo ONO Mitsumasa KOYANAGI

PAPER-LSI Architecture

Vol:
E81-A No:12
Page(s):
2655-2660
We describe a new multiport memory which is called Shared DRAM (SHDRAM) to overcome bus-bottle neck problem in parallel processor system with shared memory. The processors are directly connected to this SHDRAM without conventional common bus. The test chip with 32 kbit memory cells is fabricated using a 1. 5 µm CMOS technology. The basic operation is confirmed by the circuit simulation and experimental results. In addition, it is confirmed by the computer simulation that the system performance with SHDRAM is superior to that with conventional common buses.
5. 4 GOPS, 81 GB/s Linear Array Architecture DSP
Akihiko HASHIGUCHI Masuyoshi KUROKAWA Ken'ichiro NAKAMURA Hiroshi OKUDA Koji AOYAMA Mitsuharu OHKI Katsunori SENO Ichiro KUMATA Masatoshi AIKAWA Hirokazu HANAKI Takao YAMAZAKI Mitsuo SONEDA Seiichiro IWASE

PAPER

Vol:
E81-C No:5
Page(s):
661-668
A programmable DSP with linear array architecture for real-time video processing is reported. It achieves a processing rate of 5. 4 GOPS and 81GB/s memory bandwidth using Dual Sense Amplifier architecture. A low-power-supply pipeline decreases power consumption and a time shared bit-line reduces chip area. It has 4320 processor elements and a 1. 1 Mbit 3-port memory. The DSP can be applied to HDTV signals with its 75 MHz peak I/O rate. Sufficient programmability is provided to execute video format conversion such as image size conversion and Y/C separation, and picture quality improvement such as noise reduction and image enhancement. The chip was fabricated using 0. 4 µm CMOS triple metal technology with a 15. 12 mm 14. 95 mm die. It operates at 50 MHz and consumes 0. 53 W/GOPS at 3. 3 V.
An LSI for Low Bit-Rate Image Compression Using Vector Quantization
Kazutoshi KOBAYASHI Noritsugu NAKAMURA Kazuhiko TERADA Hidetoshi ONODERA Keikichi TAMARU

PAPER

Vol:
E81-C No:5
Page(s):
718-724
We have developed and fabricated an LSI called the FMPP-VQ64. The LSI is a memory-based shared-bus SIMD parallel processor containing 64 PEs, intended for low bit-rate image compression using vector quantization. It accelerates the nearest neighbor search (NNS) during vector quantization. The computation time does not depend on the number of code vectors. The FMPP-VQ64 performs 53,000 NNSs per second, while its power dissipation is 20 mW. It can be applied to the mobile telecommunication system.
Optical Flow Detection System Using a Parallel Processor NEURO4
Jun TAKEDA Ken-ichi TANAKA Kazuo KYUMA

PAPER

Vol:
E81-A No:3
Page(s):
439-445
An image recognition system using NEURO4, a programmable parallel processor, is described. Optical flow is the velocity field that an observer detects on a two-dimensional image and gives useful information, such as edges, about moving objects. The processing time for detecting optical flow on the NEURO4 system was analyzed. Owing to the parallel computation scheme, the processing time on the NEURO4 system is proportional to the square root of the size of images, while conventional sequential computers need time in proportion to the size. This analysis was verified by experiments using the NEURO4 system. When the size of an image is 84 84, the NEURO4 system can detect optical flow in less than 10 seconds. In this case the NEURO4 system is 23 times faster than a workstation, Sparc Station 20 (SS20). The larger the size of images becomes, the faster the NEURO4 system can detect optical flow than conventional sequential computers like SS20. Furthermore, the paralleling effect increases in proportion to the number of connected NEURO4 chips by a ring expansion scheme. Therefore, the NEURO4 system is useful for developing moving image recognition algorithms which require a large amount of processing time.
Accuracy of the Minimum Time Estimate for Programs on Heterogeneous Machines
Dingchao LI Yuji IWAHORI Naohiro ISHII

PAPER-Computer Systems

Vol:
E81-D No:1
Page(s):
19-26
Parallelism on heterogeneous machines brings cost effectiveness, but also raises a new set of complex and challenging problems. This paper addresses the problem of estimating the minimum time taken to execute a program on a fine-grained parallel machine composed of different types of processors. In an earlier publication, we took the first step in this direction by presenting a graph-construction method which partitions a given program into several homogeneous parts and incorporates timing constraints due to heterogeneous parallelism into each part. In this paper, to make the method easier to be applied in a scheduling framework and to demonstrate its practical utility, we present an efficient implementation method and compare the results of its use to the optimal schedule lengths obtained by enumerating all possible solutions. Experimental results for several different machine models indicate that this method can be effectively used to estimate a program's minimum execution time.
CORErouter-I: An Experimental Parallel IP Router Using a Cluster of Workstations
Mitsuru MARUYAMA Naohisa TAKAHASHI Takeshi MIEI Tsuyoshi OGURA Tetsuo KAWANO Satoru YAGI

PAPER-System architecture

Vol:
E80-B No:10
Page(s):
1407-1414
A parallel IP router that uses off-the-shelf wor-kstations and interconnecting switches is presented. This router, called CORErouter-I, is a medium-grained, functionally distributed parallel system consisting of four kinds of processors for routing, routing-table searching, servicing, and line interfacing. Also discussed are issues related to the implementation of CORErouter-I, especially in terms of routing protocol processing and packet-forwarding. Performance characteristics of CORErouter-I are also clarified through several experiments performed to evaluate maximum throughput, analyze packet-forwarding time, and estimate the effect of parallel processing on the route-flapping problem.
Special-Purpose Hardware Architecture for Large Scale Linear Programming
Shinhaeng LEE Shin'ichiro OMACHI Hirotomo ASO

PAPER-Computer Architecture

Vol:
E80-D No:9
Page(s):
893-898
Linear programming techniques are useful in many diverse applications such as: production planning, energy distribution etc. To find an optimal solution of the linear programming problem, we have to repeat computations and it takes a lot of processing time. For high speed computation of linear programming, special purpose hardware has been sought. This paper proposes a systolic array for solving linear programming problems using the revised simplex method which is a typical algorithm of linear programming. This paper also proposes a modified systolic array that can solve linear programming problems whose sizes are very large.
The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)² with a Distributed-Frame Buffer System
Hitoshi YAMAUCHI Takayuki MAEDA Hiroaki KOBAYASHI Tadao NAKAMURA

PAPER-Computer Architecture

Vol:
E80-D No:9
Page(s):
909-918
The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.
A Memory-Based Parallel Processor for Vector Quantization: FMPP-VQ
Kazutoshi KOBAYASHI Masayoshi KINOSHITA Hidetoshi ONODERA Keikichi TAMARU

PAPER-Multi Processors

Vol:
E80-C No:7
Page(s):
970-975
We propose a memory-based processor called a Functional Memory Type Parallel Processor for vector quantization (FMPP-VQ). The FMPP-VQ is intended for low bit-rate image compression using vector quantization. It accelerates the nearest neighbor search on vector quantization. In the nearest neighbor search, we look for a vector nearest to an input one among a large number of code vectors. The FMPP-VQ has as many PEs (processing elements, also called "blocks") as code vectors. Thus distances between an input vector and code vectors are computed simultaneously in every PE. The minimum value of all the distances is searched in parallel, as in conventional CAMs. The computation time does not depend on the number of code vectors. In this paper, we explain the detail of the architecture of the FMPP-VQ, its performance and its layout density. We designed and fabricated an LSI including four PEs. The test results and performance estimation of the LSI are also reported.
CAM-Based Highly-Parallel Image Processing Hardware
Takeshi OGURA Mamoru NAKANISHI

INVITED PAPER

Vol:
E80-C No:7
Page(s):
868-874
This paper describes content addressable memory (CAM) -based hardware that serves as a highly parallel, compact and real-time image-processing system. The novel concept of a highly-parallel integrated circuits and system (HiPIC), in which a large-capacity CAM tuned for parallel data processing is a key element, is introduced. Several hardware algorithms for highly-parallel image processing based on a HiPIC with a CAM are presented in order to demonstrate that the HiPIC concept is effective for compact and real-time image processing. Two kinds of HiPIC-dedicated CAM have been developed. One is embedded on a 0.5-µm CMOS gate array. An embedded CAM up to 64 kbit and logic up to 40 kgate can be integrated on a single chip. The other is a 0.5-µm CMOS full-custom CAM LSI tuned for parallel data processing. A fully-parallel 336-kbit CAM LSI has been successfully developed. The HiPIC concept and CAM-based hardware described here promises to be an important step towards the realization of a compact and real-time image-processing system.
Design and Analysis of Multiwave Interconnection Networks for MCM-Based Parallel Processing
Takafumi AOKI Shinichi SHIONOYA Tatsuo HIGUCHI

PAPER-Novel Concept Devices

Vol:
E80-C No:7
Page(s):
935-940
This paper explores the potential of multiwave interconnectionsoptical interconnections that employ wavelength components as multiplexable information carriersfor constructing next-generation multiprocessor systems using MCM technology. A hypercube-based multiprocessor network called the multiwave hypercube (MWHC) is proposed, where multiwave interconnections provide highly-flexible dynamic communication channels among processing elements. A performance analysis shows that the use of multiwavelength optics makes possible the reduction of network complexity on an MCM substrate, while supporting low-latency message routing.
A 3.2 GFLOPS Neural Network Accelerator
Shinji KOMORI Yutaka ARIMA Yoshikazu KONDO Hirono TSUBOTA Ken-ichi TANAKA Kazuo KYUMA

INVITED PAPER

Vol:
E80-C No:7
Page(s):
859-867
We have developed an SIMD-type neural-network processor (NEURO4) and its software environment. With the SIMD architecture, the chip executes 24 operations in a clock cycle and achieves 1.2 GFLOPS peak performance. An accelerator board, which contains four NEURO4 chips, achieves 3.2 GFLOPS. In this paper we describe features of the neural network chip, accelerator board, software environment and performance evaluation for several neural network models (LVQ, BP and Hopfield). The 3.2 GFLOPS neural network accelerator board demonstrates 1.7 GCPS and 261 MCUPS for Hopfield networks.
A Lookahead Heuristic for Heterogeneous Multiprocessor Scheduling with Communication Costs
Dingchao LI Akira MIZUNO Yuji IWAHORI Naohiro ISHII

PAPER

Vol:
E80-D No:4
Page(s):
489-494
This paper describes a new approach to the scheduling problem that assigns tasks of a parallel program described as a task graph onto parallel machines. The approach handles interprocessor communication and heterogeneity, based on using both the theoretical results developed so far and a lookahead scheduling strategy. The experimental results on randomly generated task graphs demonstrate the effectiveness of this scheduling heuristic.
Design of Array Processors for 2-D Discrete Fourier Transform
Shietung PENG Igor SEDUKHIN Stanislav SEDUKHIN

PAPER

Vol:
E80-D No:4
Page(s):
455-465
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
Data-Localization Scheduling inside Processor-Cluster for Multigrain Parallel Processing
Akimasa YOSHIDA Ken'ichi KOSHIZUKA Wataru OGATA Hironori KASAHARA

PAPER

Vol:
E80-D No:4
Page(s):
473-479
This paper proposes a data-localization scheduling scheme inside a processor-cluster for multigrain parallel processing, which hierarchically exploits parallelism among coarsegrain tasks like loops, medium-grain tasks like loop iterations and near-fine-grain tasks like statements. The proposed scheme assigns near-fine-grain or medium-grain tasks inside coarse-grain tasks onto processors inside a processor-cluster so that maximum parallelism can be exploited and inter-processor data transfer can be minimum after data-localization for coarse-grain tasks across processor-clusters. Performance evaluation on a multiprocessor system OSCAR shows that multigrain parallel processing with the proposed data-localization scheduling can reduce execution time for application programs by 10% compared with multigrain parallel processing without data-localization.
Inverter Reduction Algorithm for Super Fine-Grain Parallel Processing
Hideyuki ITO Kouichi NAGAMI Tsunemichi SHIOZAWA Kiyoshi OGURI Yukihiro NAKAMURA

PAPER

Vol:
E80-A No:3
Page(s):
487-493
We are working on an algorithm to optimize the logic circuits that can be realized on the super fine-grain parallel processing architecture. As a part of this work, we have developed an inverter reduction algorithm. This algorithm is based on modeling logic circuits as dynamical systems. We implement the algorithm in the PARTHENON system, which is the high level synthesis system developed in NTT's laboratories, and evaluate it using ISCAS85 benchmarks. We also compare the results with both the existing algorithm of PARTHENON and the algorithm of Jain and Bryant.
Parallel mB1C Word Alignment Procedure and Its Performance for High-Speed Optical Transmission
Yoshihiko UEMATSU Koichi MURATA Shinji MATSUOKA

PAPER-Communication Systems and Transmission Equipment

Vol:
E80-B No:3
Page(s):
476-482
This paper proposes a parallel word alignment procedure for m Binary with 1 Complement Insertion (mBlC) or Differential m Binary with l Mark Insertion (DmBlM) line code. In the proposed procedure for mBlC line code, the word alignment circuit searches (m+1) bit pairs in parallel for complementary relationships. A Signal Flow Graph Model for the parallel word alignment procedure is also proposed, and its performance attributes are numerically analyzed. The attributes are compared with those of the conventional bit-by-bit procedure, and it is shown that the proposed procedure displays superior performance in terms of False-Alignment Probability and Maximum Average Aligning Time. The proposed procedure is suitable for high speed optical data links, because it can be easily implemented using a parallel signal processor operating at a clock rate equal to 1/(m+1) times the mBlC line rate.
An Analog Two-Dimensional Discrete Cosine Transform Processor for Focal-Plane Image Compression
Shoji KAWAHITO Makoto YOSHIDA Yoshiaki TADOKORO Akira MATSUZAWA

PAPER

Vol:
E80-A No:2
Page(s):
283-290
This paper presents an analog 2-dimensional discrete cosine transform (2-D DCT) processor for focal-plane image compression. The on-chip analog 2-D DCT processor can process directly the analog signal of the CMOS image sensor. The analog-to-digital conversion (ADC) is preformed after the 2-D DCT, and this leads to efficient AD conversion of video signals. Most of the 2-D DCT coefficients can be digitized by a relatively low-resolution ADC or a zero detector. The quantization process after the 2-D DCT can be realized by the ADC at the same time. The 88-point analog 2-D DCT processor is designed by switched-capacitor (SC) coefficient multipliers and an SC analog memory based on 0.35µm CMOS technology. The 2-D DCT processor has sufficient precision, high processing speed, low power dissipation, and small silicon area. The resulting smart image sensor chips with data compression and digital transmission functions are useful for the high-speed image acquisition devices and portable digital video camera systems.

61-80hit(118hit)

Keyword Search Result

[Keyword] parallel process(118hit)

A Real-Time Low-Rate Video Compression Algorithm Using Multi-Stage Hierarchical Vector Quantization

Viewpoint-Based Similarity Discernment on SNAP

Evaluation of Shared DRAM for Parallel Processor System with Shared Memory

5. 4 GOPS, 81 GB/s Linear Array Architecture DSP

An LSI for Low Bit-Rate Image Compression Using Vector Quantization

Optical Flow Detection System Using a Parallel Processor NEURO4

Accuracy of the Minimum Time Estimate for Programs on Heterogeneous Machines

CORErouter-I: An Experimental Parallel IP Router Using a Cluster of Workstations

Special-Purpose Hardware Architecture for Large Scale Linear Programming

The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)² with a Distributed-Frame Buffer System

A Memory-Based Parallel Processor for Vector Quantization: FMPP-VQ

CAM-Based Highly-Parallel Image Processing Hardware

Design and Analysis of Multiwave Interconnection Networks for MCM-Based Parallel Processing

A 3.2 GFLOPS Neural Network Accelerator

A Lookahead Heuristic for Heterogeneous Multiprocessor Scheduling with Communication Costs

Design of Array Processors for 2-D Discrete Fourier Transform

Data-Localization Scheduling inside Processor-Cluster for Multigrain Parallel Processing

Inverter Reduction Algorithm for Super Fine-Grain Parallel Processing

Parallel mB1C Word Alignment Procedure and Its Performance for High-Speed Optical Transmission

An Analog Two-Dimensional Discrete Cosine Transform Processor for Focal-Plane Image Compression

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles